history: per-message cost, tokens, and latency tracking#16
Conversation
|
please don't review the code yet – I'm looking into a better way to get pricing details... |
|
This is addressing #15 |
0de06df to
9933121
Compare
|
@ankrgyl this is ready to review, added tests |
|
@ankrgyl this is ready to review, added tests |
Alexsun1one
left a comment
There was a problem hiding this comment.
two small things on the price lookup side, otherwise this looks great. the LiteLLM-as-source-of-truth + per-provider accounting story is the right call, and i like that with_pricing keeps embedders/air-gapped runs honest. inline notes below.
| } | ||
| self.entries | ||
| .iter() | ||
| .filter(|(key, _)| model.starts_with(key.as_str())) |
There was a problem hiding this comment.
prefix-match without a separator boundary can fall through in surprising ways. e.g. gpt-4o-mini matches gpt-4o (intended) but also matches gpt-4 (not intended), since "gpt-4o".starts_with("gpt-4") is true; if the more specific entry is missing the longest-prefix winner is still wrong.
if the goal is the claude-sonnet-4-6-20251022 -> claude-sonnet-4-6 case, tightening the filter to require model[key.len()..] be empty or start with - / : keeps that behavior and stops gpt-4o from sliding into gpt-4 when an entry is absent.
| // two when cached==0 (the typical case). | ||
| match provider { | ||
| Some(p) if p.starts_with("anthropic") => Self::Additive, | ||
| Some(p) if p.starts_with("bedrock") => Self::Additive, |
There was a problem hiding this comment.
this catches bedrock_converse correctly but also catches plain bedrock provider entries, which on Bedrock covers Mistral, Cohere, Meta Llama, AI21, and friends. those follow OpenAI-style inclusive prompt_tokens, not Anthropic-style additive.
the real-world impact is small today because cache_read on non-Anthropic Bedrock models is uncommon. but if/when those providers add caching, the formula will over-count fresh-input tokens.
probably starts_with("bedrock_converse") only, or an explicit allow-list of bedrock-anthropic provider strings.
|
|
||
| async fn try_load() -> anyhow::Result<PricingTable> { | ||
| // 1. Local path override — used by tests and air-gapped setups. | ||
| if let Ok(path) = std::env::var("EXO_LITELLM_PRICES_PATH") { |
There was a problem hiding this comment.
i prefer all env vars to be parsed through clap, so that someone could propagate the pricing table as a CLI arg too. it also forces all functions (like this one) to be relatively pure
maybe we add this to a skill in the repo somewhere?
a0092fc to
74c28a7
Compare
Add an optional UsageRecord to every EventData::Messages event: model id, raw token counts (prompt / completion / cached / cache-creation / reasoning), USD cost, and TTFT + wall-clock duration. Fields are Option + skip_serializing_if and the record is boxed; legacy events still parse. Cost is policy, computed in userspace, never by the trusted substrate: - crates/cost: a standalone library with the price-table data model, a self-contained LiteLLM loader (explicit path/url, on-disk cache, degrade-to-empty), and per-provider math. Lookup is boundary-aware so dated revisions resolve without sliding a model onto a shorter neighbor's rate. Anthropic-family bills additively; everything else (including Bedrock, a TODO) is inclusive. - exoharness stays minimal: it holds the UsageRecord schema and persists it verbatim, with no pricing code or dependency. - Basic executor fills cost from a table loaded once at startup and injected via the CLI (--pricing-path / --pricing-url, env as fallback). - The TypeScript harness (exoclaw) has its own self-contained cost port (@exo/model-runtime/cost) that owns its data loading (env override, own cache, own fetch) through the harness's normal config flow, so per-message cost works there with no dependency on the Rust loader or the trusted layer. RLM is left unwired for now: its multi-call turn has different per-message accounting and is a separate follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Write up the cost-tracking design: cost as a userspace policy library (self-contained per language), the minimal substrate, per-provider math, boundary-aware lookup, the loader, and the trust framing (usage is agent-reported telemetry). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
74c28a7 to
dd64f38
Compare
|
rewriting in #54 |
Summary
Adds an optional
UsageRecordto everyEventData::Messagesevent so we have a durable, per-message record of:server_duration_ms— reserved (lingua does not yet surface a provider-reported processing time)Pricing: why runtime LiteLLM JSON, not a hardcoded table
OpenAI and Anthropic standard APIs return tokens but not USD in their per-call responses. (OpenAI never; Anthropic only in a separate aggregate Admin API.) So cost must always be computed downstream.
Initial version hardcoded a small price table in Rust. That was a mistake — by the time I wrote the first commit, my table already had Claude Opus 4.7 at $15/$75 per MTok when the real price had dropped to $5/$25. Hand-maintained tables drift.
This version loads LiteLLM's pricing database (2,739 model entries, community-maintained, covers all major providers + Bedrock/Azure/Vertex regional variants):
Cached-vs-fresh: per-provider accounting matters
Different providers report cached tokens with different conventions, and getting this wrong distorts cost by up to ~10× on cache-heavy requests:
anthropic,bedrock_converse,vertex_ai-anthropic_models,azure_ai):prompt_tokensis fresh input only.cache_readandcache_creationare separate. Bill all three additively.openai,mistral, etc.):prompt_tokensis total (including cached). Cached is a subset. Must subtract cached from prompt before billing fresh-input rate.The first commit got OpenAI wrong (used the additive formula universally). This version branches on LiteLLM's `litellm_provider` field and applies the correct formula. Both formulas have dedicated unit tests with realistic token mixes.
Architecture
`exoharness::pricing` (pure data + math, no network, stays wasm-compatible):
`executor::pricing_loader` (network layer, gated by `tokio::sync::OnceCell`):
`BasicExecutor::with_pricing` + `BasicHarness::with_pricing_table`: explicit-table constructors. Bypass the loader, useful for tests/embedders/air-gapped deployments.
What's in the event JSON now
```json
{
"type": "messages",
"messages": [...],
"response_id": "01J...",
"usage": {
"model": "claude-sonnet-4-6",
"prompt_tokens": 2847,
"completion_tokens": 412,
"prompt_cached_tokens": 12500,
"cost_usd": 0.0146985,
"ttft_ms": 842,
"duration_ms": 3210
}
}
```
All `usage` sub-fields are `Option` + `skip_serializing_if`. Legacy events with no `usage` key continue to deserialize.
Test plan
Not in scope (intentional)
🤖 Generated with Claude Code